Windows Azure : Exploring Full-Text Search (part 3)

10/22/2010 6:08:53 PM

3.6. Stemming

You are almost ready to start indexing your literary classics. But first, you must add some stemming code. As you learned earlier, any nontrivial amount of text contains several variants of the same words: plural forms, different tenses, and so on. Users will want to search for any of these forms. (Imagine not finding a result on Google because you mistakenly searched for a plural.)

To work around this issue, you store and search the “stemmed” form of every word. Every word is converted into a root form and, since all variants share the same root, finding the root will satisfy searches for all variants. For example, the root for Connect, Connection, Connected, Connecting, and Connections is the same: Connect. By storing only the root stemmed term, users can search successfully for any of the variants.

There are several algorithms to do just this. Let’s use the Porter Stemming Algorithm. Specifically, let’s use the C# implementation of this algorithm, available at http://tartarus.org/~martin/PorterStemmer/csharp2.txt.

Note: The implementation of this algorithm is based on Martin Porter’s 1980 paper, “An algorithm for suffix stripping.” You can find the original paper at http://tartarus.org/~martin/PorterStemmer/def.txt.

This file requires a little modification before you can use it in this project. Remove the following code from the top and add it to the project as Stemmer.cs:

using System.Windows.Forms;

[assembly: AssemblyTitle("")]
[assembly: AssemblyDescription("Porter stemmer in CSharp")]
[assembly: AssemblyConfiguration("")]
[assembly: AssemblyCompany("")]
[assembly: AssemblyProduct("")]
[assembly: AssemblyCopyright("")]
[assembly: AssemblyTrademark("")]
[assembly: AssemblyCulture("")]
[assembly: AssemblyVersion("1.4")]
[assembly: AssemblyKeyFile("keyfile.snk")]
[assembly: AssemblyDelaySign(false)]
[assembly: AssemblyKeyName("")]

Note: You don’t need the attributes, since Visual Studio already specifies them in AssemblyInfo.cs automatically for you. The WinForms reference is also removed, since this is a command-line interface (CLI) application.

The key method in the file is stemTerm. You’ll be calling this to get the stemmed version of every word you index.

3.7. Indexing

You are now ready to index your content. To get the files, you asked the user in the Main method to enter index followed by the directory path containing the files to index. Let’s add some code to construct a Document object out of every file in the directory, and pass it off to an Indexer class for indexing. You give every document a unique GUID to ensure that it has a unique identifier in the document table. (Remember that the document ID is also used as the partition key.)

Add the following code in Program.cs in the Program class, and replace the blank Index method with this implementation:

      static void Index(string path)
        {
            foreach (var file in Directory.GetFiles(path))
            {
                string title = Path.GetFileName(file);
                string content = File.ReadAllText(file);
                Indexer.AddDocument(
                new Document( title, Guid.NewGuid().ToString()), content);

            }
        }

The core indexing magic will happen in the Indexer class, which you’ll see in a moment.

The algorithm for indexing a document is as follows:

You add the Document object to the Document table in Azure storage by adding it to an instance of FTSDataServiceContext.
You strip out all punctuation, carriage returns, line feeds, and undesirable characters in the document content. You use a simple regular expression that looks for the ASCII ranges for punctuation and special characters, as well as .NET’s regular expression support, to do this.
You split the content into words. Since you removed all other special characters in the previous step, you can easily word-break on spaces.
You construct a .NET Dictionary object, and add every new stemmed term you see to it. You then loop over all words and, when you come across a new word, check whether it is already present in the dictionary. If it isn’t present already, you add it. At the end of this process, you have a list of unique terms in the content. To stem a term, you call stemTerm in the PorterStemmer class.
You loop through every term collected in step 4 and construct a row for it in the inverted index table, mapping that term to the current document’s ID. You save changes back to the cloud by calling FTSDataServiceContext.SaveChangesWithRetries.

The following code does all of this. It follows the same set of steps as the algorithm specified. Add it as Indexer.cs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Configuration;
using PorterStemmerAlgorithm;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.WindowsAzure;

namespace FTS
{
    public static class Indexer
    {
        public static void AddDocument(Document doc, string docContent)
        {
            var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
       ["DataConnectionString"]);
            var ctx =
            new FTSDataServiceContext(account.TableEndpoint.ToString(),
 account.Credentials);

            //Add document to documents table
            ctx.AddObject(FTSDataServiceContext.DocumentTableName, doc);

            //Now let's find all the terms
            var terms = new Dictionary<string,bool>();

            //Normalize document - strip punctuation
            var content =  Regex.Replace(docContent, @"[!-\/:-@\[-\`]", " ");
            //Strip line feeds and carriage returns
            content = content.Replace('\r', ' ');
            content = content.Replace('\n', ' ');
            var possibleTerms = content.Split(' ');

            foreach (var possibleTerm in possibleTerms)
            {

                if (possibleTerm.Trim().Length == 0)
                {
                    continue;
                }
                var stemmer = new PorterStemmer(); //Cannot reuse
                var stemmedTerm = stemmer.stemTerm(possibleTerm.Trim().ToLower());
                terms[stemmedTerm] = true;
            }

            //Loop through terms and add pointer to document
            foreach (var term in terms.Keys)
            {
                ctx.AddObject(FTSDataServiceContext.IndexTableName,
                   new IndexEntry(term, doc.ID));
                ctx.SaveChangesWithRetries();
            }


        }
    }
}

And that’s all the indexing code you need! At this point, run the project.

In the console that pops up type in index followed by the path to your text files, and press Enter. The program should churn away for a while as it creates your inverted indexes, as shown in Figure 2 . The actual time taken depends on the number of files and the size of each file. It should be on the order of several minutes, so treat yourself to a coffee. At the end, you will have a fully constructed set of inverted indexes in the cloud, mapping stemmed terms to the documents in which they appear.

Figure 2. Indexer

3.8. Searching for a single term

Now that you have a fully constructed index, let’s write some searching code.

In Program.cs, replace the blank Search method with the following code to pass off search queries to the helper Searcher class. This helper class will return a List<Document> that you iterate over and display as the results:

static void Search(string query)
        {
            var results = Searcher.Search(query);
            if (results.Count == 0)
            {
                Console.WriteLine("No results found!");
                return;
            }

            foreach (var doc in results)
            {
                Console.WriteLine(doc.Title);
            }

        }

Compared to the indexing code, the search code is much simpler. The first pass at your search code will support searching for only one term. Copy the following code into Searcher.cs and add it to the project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Configuration;
using System.Text;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.Samples.ServiceHosting.StorageClient;
using PorterStemmerAlgorithm;
using System.Threading;

namespace FTS
{
    public class Searcher
    {

public static List<Document> Search(string queryTerms)
        {
            var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
   ["DataConnectionString"]);
            var ctx = new FTSDataServiceContext(
              account.TableEndpoint.ToString(), account.Credentials);

            //Clean up query terms
            queryTerms = queryTerms.Trim().ToLower();

            var stemmer = new PorterStemmer();
            // At this time, we support only one term in the query. Let's stem it
            var stemmedTerm = stemmer.stemTerm(queryTerms);

            // Look up all inverted index entries for the stemmed term
            var indexQuery = from indexEntry in
ctx.CreateQuery<IndexEntry>(FTSDataServiceContext.IndexTableName)
                        where indexEntry.Term == stemmedTerm
                        select indexEntry;
            var results = new List<Document>();

            // For every inverted index entry in which the stemmed
            // term exists, look up the Document
            // and add to our list of results

            foreach (IndexEntry indexEntry in indexQuery)
            {
                var docQuery = from doc in

                ctx.CreateQuery<Document>
                  (FTSDataServiceContext.DocumentTableName)
                               where doc.ID == indexEntry.DocID
                               select doc;

                results.Add(docQuery.FirstOrDefault());

            }

            // Return our list of results
            return results;
        }    }
}

The code first creates an FTSDataServiceContext with which to perform queries later. At this time, let’s deal only with single search terms—you’ll add support for dealing with multiple search terms soon. Since you have only a single search term, you can directly case-fold into lowercase and stem it.

You then query the inverted index table using the stemmed term and look up all entries that have the stemmed term. This returns a list of document IDs (if the search was successful). Since displaying document IDs in the results is not useful, you look up the Document entity for each of these document IDs so that you can display the title of each of these results.

At this point, you have a fully functional FTS engine that is good to go. Press F5 and run your project. Search for any term that appears in the text you’ve indexed by typing search followed by the term. Since a few classic literary works have been used in this sample, the following sample queries yield good results.

First, let’s try a search that should yield only one result:

>search Dracula
Dracula - Bram Stoker

Now, let’s try a common word across many books that should yield several results:

>search war
Count of Monte Cristo - Alexandre Dumas.txt
Art of War - Sun Tzu.txt
Dracula - Bram Stoker
Fairy Tales - The Brothers Grimm.txt
Around the World in Eighty Days - Jules Verne
On the Origin of Species - Charles Darwin

As you can see, basic search works well. But the key feature in an FTS engine is the ability to deal with spelling variations.

Let’s try various versions of the last search. Try searching for wars or warring. You’ll find that you get the same set of results, even if the variant you typed doesn’t occur verbatim in the text. That’s exactly what you want, because users rarely know the precise form that occurs in the source text or content.

3.9. Searching multiple terms

The preceding code searches for a single term only. But what about multiple terms where you want to do a Boolean AND and show results only where all terms are present?

To do so, you must modify the previous code to search for each individual term, and perform an intersection of all the terms. Doing so in sequence would be slow. To speed things up, you queue up all the requests in parallel using a thread queue, and wait for them to finish using a ManualResetEvent for each request. Therefore, the total time depends only on the slowest request, rather than being the sum of the time taken for all requests.

To perform the intersection, you use .NET 3.5’s built-in HashSet class, as shown in the following code. Replace your earlier implementation of Search with the following one:

  public static List<Document> Search(string queryTerms)
        {
            var account =
CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
                         ["DataConnectionString"]);

            var terms = queryTerms.Contains(' ')?
                    queryTerms.ToLower().Trim().Split(' ')                    : new
string[1]{queryTerms} /* Just one term */;


            var resetEvents = new ManualResetEvent[terms.Length];
            var allResults = new HashSet<Document>[terms.Length];
            for (int i = 0; i < terms.Length; i++)
            {
                resetEvents[i] = new ManualResetEvent(false);
                ThreadPool.QueueUserWorkItem(new WaitCallback((object index) =>
                {
                    var ctx =
                    new FTSDataServiceContext(account.TableEndpoint.ToString(),
                    account.Credentials);


                    var stemmer = new PorterStemmer();
                    var stemmedTerm = stemmer.stemTerm(terms[(int)index]);
                    var indexQuery = from indexEntry in
                                     ctx.CreateQuery<IndexEntry>
                                    (FTSDataServiceContext.IndexTableName)
                                     where indexEntry.Term == stemmedTerm
                                     select indexEntry;
                    var results = new HashSet<Document>();

                    foreach (IndexEntry indexEntry in indexQuery)
                    {
                        var docQuery = from doc in
                        ctx.CreateQuery<Document>
                       (FTSDataServiceContext.DocumentTableName)
                                       where doc.ID == indexEntry.DocID
                                       select doc;

                        results.Add(docQuery.FirstOrDefault());

                    }
                    allResults[(int)index] = results;
                    resetEvents[(int)index].Set();
                }), i);


            }

            //Wait for all parallel queries to finish executing
            WaitHandle.WaitAll(resetEvents);

            // Use HashSet's intersection ability. Set it to the first term's
            //results and intersect with
            // results for all other terms. Though the extension method
            // Intersect works with any IEnumerable,
            // using HashSet gives us the best search performance
            IEnumerable<Document> finalResults =
               (IEnumerable<Document>)allResults[0];
            foreach (var termResults in allResults)
            {
                finalResults = finalResults.Intersect(termResults);
            }

            return finalResults.ToList<Document>();
        }

You can now run the project, search for multiple terms, and look for results that contain all terms. Here’s a search query where each term exists in multiple books, but the combination exists in only one:

>search france dantes Marseilles
Count of Monte Cristo - Alexandre Dumas.txt

3.10. Ideas for improvement

This discussion has barely scratched the tip of the proverbial iceberg. Information retrieval is a vast field, and there is a lot of existing research and code on indexing, stemming, word breaking, ranking, and various other aspects of a search.

There is also room for improvement in how you use Windows Azure tables. For example, this examination did not include snippets of the results and where they appear inside each book. Indexing is quite slow in this current implementation, because you make a round trip to the server per term, and it will need to speed up before it can be used for interactive data entry, as opposed to a background batch job.

Depending on your scenario, there is a lot of scope for improvement on the ideas presented so far.